Skip to content

[Directory] Strengthen single activation guarantees even further.#9958

Open
ledjon-behluli wants to merge 8 commits intodotnet:mainfrom
ledjon-behluli:grain_leases
Open

[Directory] Strengthen single activation guarantees even further.#9958
ledjon-behluli wants to merge 8 commits intodotnet:mainfrom
ledjon-behluli:grain_leases

Conversation

@ledjon-behluli
Copy link
Copy Markdown
Contributor

@ledjon-behluli ledjon-behluli commented Mar 1, 2026

This PR adds safety lease holds to the new strongly-consistent grain directory to prevent split-brain scenarios after silo crashes, utilizing both specific grain leases (tombstones) and broader range leases when snapshot transfers fail. Under normal graceful shutdowns the new directory performs handovers, so no range lease holds are applied.

Users can specify the lease duration, otherwise if duration <= zero than it is computed as: 2 × ProbeTimeout × NumMissedProbesLimit which should provide a good buffer so that a network-partitioned silo has enough time to realize its isolated and shut itself down before the rest of the cluster allows those grains to reactivate elsewhere.

The impact on normal registrations in a healthy cluster should be practically unobservable.

Note, this wont work with the eventually-consistent directory!

Microsoft Reviewers: Open in CodeFlow

@ledjon-behluli
Copy link
Copy Markdown
Contributor Author

@ReubenBond addressed the feedback

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR strengthens the strongly-consistent distributed grain directory’s single-activation guarantees by introducing “safety lease holds” after ungraceful silo failures and during recovery scenarios, delaying re-registrations to reduce split-brain risk.

Changes:

  • Add silo-level and range-level lease hold tracking to directory partitions, including periodic cleanup of expired leases.
  • Enforce lease holds during RegisterAsync by rejecting registrations with a new DirectoryLeaseHoldException.
  • Add configuration (SafetyLeaseHoldDuration) and a new test validating reactivation blocking after an ungraceful shutdown.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
test/Orleans.GrainDirectory.Tests/Orleans.GrainDirectory.Tests.csproj Adds Microsoft.Extensions.TimeProvider.Testing package for FakeTimeProvider in tests.
test/Orleans.GrainDirectory.Tests/GrainDirectory/GrainDirectoryLeaseTests.cs New test exercising lease-hold behavior after KillSiloAsync.
src/Orleans.Runtime/GrainDirectory/GrainDirectoryPartition.cs Tracks/creates lease holds, applies holds on silo death, and adds cleanup logic + logging.
src/Orleans.Runtime/GrainDirectory/GrainDirectoryPartition.Interface.cs Enforces lease holds on registration by throwing DirectoryLeaseHoldException.
src/Orleans.Runtime/GrainDirectory/DistributedGrainDirectory.cs Computes effective lease duration, wires TimeProvider, and runs periodic lease cleanup requests.
src/Orleans.Runtime/Configuration/Options/GrainDirectoryOptions.cs Adds SafetyLeaseHoldDuration option and documentation; switches to file-scoped namespace.
src/Orleans.Core.Abstractions/Exceptions/DirectoryLeaseHoldException.cs Introduces a new exception type for lease-hold rejections.
Comments suppressed due to low confidence (2)

src/Orleans.Runtime/Configuration/Options/GrainDirectoryOptions.cs:88

  • Typo in XML doc comment: "clcks skues" should be "clock skews".
    /// Conditional deregistration is used for lazy clean-up of activations whose prompt deregistration failed for some reason (e.g., message failure).
    /// This should always be at least one minute, since we compare the times on the directory partition, so message delays and clcks skues have
    /// to be allowed.

test/Orleans.GrainDirectory.Tests/GrainDirectory/GrainDirectoryLeaseTests.cs:65

  • Grammar in comment: "its the only one alive" should be "it's the only one alive".
            // Time has expired, we can place it now, and it will be the primary as its the only one alive.
            Assert.Equal(await leaseGrain.GetAddress(), primary.SiloAddress);

Comment on lines +27 to +28
// The value doesnt matter we will advance time manually.
public static readonly TimeSpan LeaseHoldDuration = TimeSpan.FromSeconds(5);
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling in comment: "doesnt" should be "doesn't".

This issue also appears on line 64 of the same file.

Copilot uses AI. Check for mistakes.
{
if (_leaseHoldDuration > TimeSpan.Zero)
{
// We pessimistically asssume if snapshot transfer failed, than safety is needed.
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling/grammar in comments: "asssume" -> "assume" and "than" -> "then".

Suggested change
// We pessimistically asssume if snapshot transfer failed, than safety is needed.
// We pessimistically assume if snapshot transfer failed, then safety is needed.

Copilot uses AI. Check for mistakes.
Comment on lines +41 to +45
if (lockedRange.Contains(rangeHash))
{
// We reject, the client should retry!
throw new DirectoryLeaseHoldException($"Range {lockedRange} is under a lease hold until {expiration - utcNow}.");
}
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception messages use "until {expiration - utcNow}" but {expiration - utcNow} is a remaining duration, not an absolute time. Consider changing the wording to "for {remaining}" or include the absolute expiration timestamp to avoid confusion when diagnosing retries.

Copilot uses AI. Check for mistakes.
{
public class GrainDirectoryOptions
/// <summary>
/// Configuration type that controls the type of the grain directory caching algorithm that silo use.
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in XML doc comment: "silo use" should be "silos use".

This issue also appears on line 86 of the same file.

Suggested change
/// Configuration type that controls the type of the grain directory caching algorithm that silo use.
/// Configuration type that controls the type of the grain directory caching algorithm that silos use.

Copilot uses AI. Check for mistakes.
Comment on lines +494 to +500
if (_leaseHoldDuration > TimeSpan.Zero)
{
// We pessimistically asssume if snapshot transfer failed, than safety is needed.
var expiration = _timeProvider.GetUtcNow().UtcDateTime.Add(_leaseHoldDuration);
_rangeLeaseHolds.Add((addedRange, expiration));
LogWarningLeaseHoldForRange(_logger, addedRange, expiration);
}
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new range-lease-hold behavior (adding _rangeLeaseHolds on snapshot transfer failure) doesn’t appear to have targeted test coverage. Consider adding tests which (a) force a snapshot transfer failure and assert registrations in that range throw DirectoryLeaseHoldException until expiry, and (b) verify a graceful shutdown path does not apply lease holds.

Copilot generated this review using guidance from repository custom instructions.
Comment on lines +825 to +830
foreach (var kvp in _directory)
{
if (expiredSilos.Contains(kvp.Value.SiloAddress!))
{
if (_directory.Remove(kvp.Key))
{
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CleanupExpiredLeasesCore iterates _directory using foreach while calling _directory.Remove(...) inside the loop. Removing entries from a Dictionary during enumeration will throw InvalidOperationException, so lease cleanup will repeatedly fail (and only be caught/logged). Collect the keys to remove first (or use ToArray()/List<GrainId>), then remove them in a separate loop; also consider using a HashSet<SiloAddress> for expiredSilos to avoid repeated Contains scans.

Copilot uses AI. Check for mistakes.
Comment on lines +258 to +261
// If it was ShuttingDown, it surrendered its ownership gracefully.
// If it was Active (or Joining) and suddenly became Dead, it crashed.

if (previousStatus is not SiloStatus.ShuttingDown && _leaseHoldDuration > TimeSpan.Zero)
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graceful shutdowns typically transition ShuttingDown -> Stopping -> Dead. This check only treats previousStatus == ShuttingDown as graceful, so a gracefully-stopping silo will likely have previousStatus == Stopping when the Dead update arrives and will incorrectly get a safety lease hold applied. Consider treating Stopping (and possibly any previousStatus.IsTerminating()) as graceful to match the intent in the comment/PR description.

Suggested change
// If it was ShuttingDown, it surrendered its ownership gracefully.
// If it was Active (or Joining) and suddenly became Dead, it crashed.
if (previousStatus is not SiloStatus.ShuttingDown && _leaseHoldDuration > TimeSpan.Zero)
// If it was ShuttingDown or Stopping, it surrendered its ownership gracefully.
// If it was Active (or Joining) and suddenly became Dead, it crashed.
var isGracefulShutdown = previousStatus is SiloStatus.ShuttingDown or SiloStatus.Stopping;
if (!isGracefulShutdown && _leaseHoldDuration > TimeSpan.Zero)

Copilot uses AI. Check for mistakes.
Comment on lines +98 to +103
_leaseHoldDuration = directoryOptions.Value.SafetyLeaseHoldDuration switch
{
null => 2 * membershipOptions.Value.ProbeTimeout * membershipOptions.Value.NumMissedProbesLimit,
TimeSpan duration when duration >= TimeSpan.Zero => duration,
_ => throw new InvalidOperationException("Lease hold duration must be non-negative.")
};
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description says lease duration is computed when the configured duration is <= 0, but the implementation here computes the duration only when SafetyLeaseHoldDuration is null and treats TimeSpan.Zero as an explicit "disable leases" value (and throws for negatives). Either update the PR description (and/or option docs) to match this behavior, or adjust the implementation to match the described <= 0 semantics.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member

@ReubenBond ReubenBond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add some more tests. I'll ask copilot to add some more coverage.

ReubenBond and others added 3 commits March 26, 2026 10:11
…lookup, and multi-grain scenarios

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When a safety lease hold is active (e.g., after an ungraceful silo crash or
during cluster startup with non-contiguous view changes), the partition
returns DirectoryResult.RetryAfter(remainingDuration) instead of throwing.
InvokeAsync checks RetryAfterDelay and waits the suggested duration before
retrying, keeping the retry inside the directory layer.

Previously, throwing DirectoryLeaseHoldException would propagate to the
activation layer, which deactivates with DirectoryFailure and forwards the
message to the same silo. Because DirectoryFailure is treated as a transient
error, the message ForwardCount is decremented on each cycle, preventing the
hop limit from being reached and creating an infinite zero-delay forwarding
loop that hangs the system.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Return RetryAfter from RegisterAsync instead of throwing, so InvokeAsync
waits the exact remaining lease duration using the injected TimeProvider.
This removes the DirectoryLeaseHoldException class entirely.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@ReubenBond
Copy link
Copy Markdown
Member

@ledjon-behluli I made a minor change to not throw when a lease is still held. Now, the directory returns a 'retry after delay' result instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants